| tag_id | tag_name | goodreads_book_id | count |
|---|---|---|---|
| 0 | - | 665 | 3 |
| 0 | - | 81176 | 5 |
| 0 | - | 147074 | 2 |
| 0 | - | 2983489 | 3 |
| 0 | - | 6713071 | 4 |
| 0 | - | 6952423 | 4 |
| 0 | - | 7494887 | 3 |
| 1 | –1- | 43799 | 3 |
| 1 | –1- | 1207904 | 3 |
| 2 | –10- | 513367 | 3 |
Final Project
Final Project
Data Science 1 with R (STAT 301-1)
Final Project
Things to Fix
Q1, Pt 1
If two books have the same average rating, put the book with the higher rating count ahead of the one with the lower rating count.
Filter with 1,000+ ratings. Or mention that this book only contains the 10,000 most popular books on Goodreads, so that’s why this filtering doesn’t matter since it already takes care of that itself.
Q1, Pt 3
- Mention that the average is so high bc these are the 10,000 most popular books on Goodreads.
Notes
- maybe use gg_miss_var
Set Up
An Overview of My Data
What is Goodreads?
Officially, this is how Goodreads describes itself: “Goodreads is the world’s largest site for readers and book recommendations. Our mission is to help readers discover books they love and get more out of reading. Goodreads launched in January 2007.”
What is my data about?
It comes in five separate csv files: “books”, “books_tags”, “ratings”, “tags”, and “tbr”
(For context, users on Goodreads can tag books and add them to their shelves. And “tbr” stands for “to be read”.)
In this document, when I say “Goodreads data”, I am referring to the five datasets generally. For individual datasets, I will use their more specific name.
My Objectives
There are several questions I’m interested in examining in Goodreads data.
What are the most highly rated books?
Do readers who leave the most ratings leave the higher ratings on average or lower ratings on average?
Is there any relationship between the number of times a book appears in the tbr dataset and the number of ratings it received?
What is the relationship between a book’s average rating and the year it was published?
How do users tag the most highly rated books? Is there a trend?
Joining Data
The README in my data source has some notes userful for joining the five Goodreads datasets. I’ve copied and pasted the notes I found most interesting and relevant below. I’ll re-explain the key points when I join my data in the following sections, but I just wanted to put the most interesting notes about the Goodreads data together in the same section:
Ratings
Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424.
to_read.csv provides IDs of the books marked “to read” by each user, as user_id,book_id pairs, sorted by time. There are close to a million pairs.
books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata have been extracted from goodreads XML files, available in books_xml.
Tags
book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs. They are sorted by goodreads_book_id ascending and count descending.
Each tag/shelf is given an ID. tags.csv translates tag IDs to names.
Goodreads IDs
Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.
Note that book_id in ratings.csv and to_read.csv maps to work_id, not to goodreads_book_id, meaning that ratings for different editions are aggregated.
Step 2: Joining books and ratings
Why am I joining books and ratings?
Joining books and ratings will allow me to start answering on three of my research questions: “What are the most highly rated books?”, “What is the relationship between a book’s average rating and the year it was published?” and “Do readers who leave the most ratings leave the higher ratings on average or lower ratings on average?”
Now let’s actually join books and ratings
But first, some prefaces:
The README says:
Both
book_idanduser_idare contiguous in the ratings.csv file.It also contains an important note:
book_idin the ratings dataset and tbr dataset map towork_idin the books dataset (notgoodreads_book_idas we would intuitively assume).
This article linked on the About section of the github where I found the Goodreads dataset says:
“Each book may have many editions.
goodreads_book_idandbest_book_idgenerally point to the most popular edition of a given book, while goodreadswork_idrefers to the book in the abstract sense.”It also again states that “Note that book_id in ratings.csv and to_read.csv (I, Valerie, called this the”tbr” dataset) maps to
work_id, not to goodreads_book_id. It means that ratings for different editions are aggregated.
I read through the README of the dataset, the article linked on the dataset, and the Kaggle which has descriptions for the previous, unupdated version of the github dataset I’m using.
From these three links, I’ve arrived at several conclusions:
Mapping
book_idin the ratings dataset towork_idin the books dataset is only necessary if I care about different editions being aggregated. I don’t. I’m more interested in the most popular edition of a given book. That’s whatgoodreads_book_idtracks, according to the article.And
books_idin the books dataset should be equivalent tobooks_idin the ratings dataset. So I’m going to see if I can join the books and ratings dataset usingbooks_idas the key:
Let’s start by seeing whether there is implicit missingness using antijoins.
There are now rows for book_id within books that don’t have a matching book_id within rating and vice versa. All rows match.
So book_id is the primary key for both datasets and also a foreign key that can be used to join the two datasets.
Just a quick side note because I find this very interesting: If I were to have instead mapped work_id in books with book_id in ratings (which I didn’t because I’m interested in the most popular edition of a given book, not different editions being aggregated). But just hypothetically speaking. Then I would have first needed to condense the ratings into summarized outputs then join that with books using a left join (because there are 9,824 rows where work_id doesn’t match to book_id and 5,846,293 rows where book_id doesn’t match to work_id (there are simply many more work_id rows than book_id rows because, again, multiple book editions. See my R file for more details). But since I’m not doing that because I’m not interested in different editions being aggregated, all I need to do is use book_id to join the datasets books and ratings.
Now that the new books_and_ratings dataset has been created:
Now, for the fun part. An EDA with our two new datasets, books_and_tags and books_and_ratings. And maybe some of the original datasets too.
Question 1: Which are the most highly rated books?
Question 1, Part 1: Which 10 books have the highest average Goodreads rating?
To answer this question, I’m going to look at the books dataset. The variable average_rating can help us figure this out.
10 Books with the Highest Average Goodreads Rating
I’m surprised at some of the ratings of these books, yet very much not surprised at others.
“The Complete Calvin and Hobbes” is more geared toward middle-grade readers, so it’s odd that it has the highest rating.
But I’m not surprised that “Words of Radiance” had the second highest average Goodreads rating. I know that the Goodreads demographic tends to be young adults and book bloggers, most of who all enjoy reading and rating young adult fiction on Goodreads. That’s a category this book falls within.
I definitely expected a Harry Potter book to be one of Goodreads’s highest average books, but not necessarily a boxed set. (Although I should note, the problem with Goodreads is that it considers boxed sets individual books, and there are neither string functions that I can use to filter them from individual books nor tags I can use to filter them, since the naming conventions vary from boxed set to boxed set and closely resemble how individual book titles appear.) But just from being a casual Goodreads user, I’ve observed that boxed sets tend to get higher ratings than individual books of the series. So in a way, it is unsurprising that the trend holds true here.
The “ESV Study Bible” is also something I expected to see on this list, although I expected it to be ranked higher. Today, it’s not 1800s Europe where everyone owns a copy of the Bible, but there are still many people who do and love reading it.
I am surprised that various spinoffs of Calvin and Hobbes and Harry Potter dominate 7 of 10 rankings in the top 10 most highly rated books. I knew they were popular, but not that popular. But I guess Goodreads is also an American company, and those are the types of books Americans read and love.
Question 1, Part 2: Is there a relationship between the rank a book receives based on its number of ratings and its average rating?
I’m curious about whether there’s a relationship between the rank a book receives based on its number of ratings (rank_of_number_ratings) and its average_rating.
Interestingly, there appears to be no relationship between a book’s average rating and its rank based on the number of ratings it received.
And looking at the correlation confirms this: The correlation between rank_of_number_ratings andaverage_rating is almost 0.
| average_rating | rank_of_number_ratings | |
|---|---|---|
| average_rating | 1.0000000 | -0.0863433 |
| rank_of_number_ratings | -0.0863433 | 1.0000000 |
I’m very surprised at this finding. I had expected the rank of the number of ratings a book received to have at least some correlation with the average rating. So I guess people are neither more nor less inclined to rate a book based on whether they hated or loved that book.
Question 1, Part 3: What’s the distribution of book ratings?
Now let’s look at the distribution of book ratings.
Side note: Technically, a histogram is fine. (See R file for an example that I didn’t include here because it’s not that relevant.) But even when I use pretty and break the x-axis into measurements that are easy to read, it’s still not that intuitive to read. Because when I look at a Goodreads rating, I don’t think of the rating scale as continuous. I think of them in bins of 0.25. For example, a book with ≥ 4.5 ratings is excellent. A book with ≥ 4.75 is practically unheard of. And a book with a rating between 4.0 and 4.25 is great. So that’s why I’m going to use cut to create bins that are intuitive to think about.
I’ve made bins of 0.25 for each average rating. n shows how many books there are in each bin:
| average_rating_cut | n |
|---|---|
| (2.25,2.5] | 1 |
| (2.5,2.75] | 1 |
| (2.75,3] | 12 |
| (3,3.25] | 66 |
| (3.25,3.5] | 275 |
| (3.5,3.75] | 1189 |
| (3.75,4] | 3269 |
| (4,4.25] | 3695 |
| (4.25,4.5] | 1363 |
| (4.5,4.75] | 124 |
| (4.75,5] | 5 |
A few things I found interesting about the distribution of the average rating of books on Goodreads:
It’s left skewed and unimodal.
Most books (3695 of them, as seen in the table above this graph) have a rating greater than 4 and ≤ 4.25.
Another 3269 books have a rating between greater than 3.75 and ≤ 4.
1 book has an average rating greater than 2.25 and ≤ 2.5. 1 book has a rating greater than 2.5 and ≤ 2.75.
5 books have an average rating greater than 4.75 and ≤ 5.0.
So clearly, there are very very few books with an average rating of less than 3. And there are very very few books with an average rating of more than 4.75.
That means that people who rate books on Goodreads usually either:
- Like the book enough that they went to Goodreads and rated it decently highly (a 3, 4, or 5).
- Don’t like to rate extremely high or low.
- The number of people who gave books a decently high rating tend to pull up average_rating from the people who rated that book lowly.
Without more data, it’s hard to tell whether we can explain away the cluster around book ratings between 3.75 and 4.25 as one of these suggestions, a combination of these suggestions, or none of these suggestions. It’s still fun to think about though.
Question 2: Do readers who leave the most ratings leave the higher ratings on average or lower ratings on average?
Each user (user_id) can one rating to one book, so that means users can rate multiple books. Grouping by user_id and counting that number tells us the users who rate the most books on Goodreads. count is the number of books the user has rated. If we also keep in mind that this dataset only has the data of about 10k books on Goodreads, rating 200 of them (as the top two book-rating users have done) is quite impressive.
Let’s display the top 10 (of 53,424 total users who rated the top 10,000 rated books on Goodreads) who rated the most books in this Goodreads dataset.
| user_id | count |
|---|---|
| 12874 | 200 |
| 30944 | 200 |
| 12381 | 199 |
| 28158 | 199 |
| 52036 | 199 |
| 6630 | 197 |
| 45554 | 197 |
| 7563 | 196 |
| 9668 | 196 |
| 9806 | 196 |
Now, let’s see what the average ratings of these top book-rating users is.
Here’s an excerpt of our dataset “books_and_ratings”, updated with a new column called average_rating that displays the average ratings each user_id gives of the books in this dataset.